The goals / steps of this project are the following:
# imports
import numpy as np
import cv2
import glob
import matplotlib.pyplot as plt
import pickle
from time import time
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from skimage.feature import hog
%matplotlib inline
The feature contains HOG feature, color histogram, and binned color feature. As hinted in the lecture, I converted the image into different color space including RGB (original), HSV, HSL, YCrCb and tried all channels and individual channels for each conversion. Using Hue of HSV perform well as expected since it can sort differentiate the vehicles from background. Using this feature alone can present an accuracy of 0.93. Using all channels of LUV eventually stands out in the training accuracy. Color histogram and bin features both help enhancing the accuracy. The combination I picked can eventaully hit an accuracy of 0.989. Using more bins will eventually help to bring it above 0.99. However, more channels or bins will dramatically slow down the processing for each image. As the result the time to process one video might be extended from 1h to 3h. Training time will be longer with higher dimension of feature.
def extract_features(img, quiet = True):
result = []
# Compute binned color features
bin_feature = cv2.resize(img, (16, 16)).ravel()
result.append(bin_feature)
# Compute color histogram
color_channels = []
for channel in range(img.shape[2]):
color_channels.append(np.histogram(img[:,:,channel], bins=32, range=(0, 256))[0])
color_hist = np.hstack(color_channels)
result.append(color_hist)
# Compute HOG
pix_per_cell = 8
cell_per_block = 1
orient = 8
# Convert to certain color space (from RGB) for extracting the HOG features
converted = cv2.cvtColor(img, cv2.COLOR_RGB2YUV)
hog_feature = []
hog_images = []
for channel in range(img.shape[2]):
feature, hog_image = hog(converted[:,:,channel], orientations=orient, pixels_per_cell=(pix_per_cell, pix_per_cell), cells_per_block=(cell_per_block, cell_per_block), visualise=True, feature_vector=False)
hog_feature.append(np.ravel(feature))
hog_images.append(hog_image)
#feature, hog_image = hog(converted[:,:], orientations=orient, pixels_per_cell=(pix_per_cell, pix_per_cell), cells_per_block=(cell_per_block, cell_per_block), visualise=True, feature_vector=False)
hog_feature.append(np.ravel(feature))
hog_images.append(hog_image)
hog_feature = np.ravel(hog_feature)
result.append(hog_feature)
if(quiet == False):
print(np.shape(bin_feature))
print(np.shape(color_hist))
print(np.shape(hog_feature))
print(np.shape(result))
# Visualize hog feature image
f, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8)) = plt.subplots(4, 2, figsize=(20,10))
ax1.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
ax1.set_title('Original Image', fontsize=24)
ax2.imshow(converted)
ax2.set_title('Converted Image', fontsize=24)
ax3.imshow(converted[:,:,0], cmap='gray')
ax3.set_title('Image Channel 0', fontsize=24)
ax4.imshow(hog_images[0], cmap='gray')
ax4.set_title('Hog Feature Image Channel 1', fontsize=24)
ax5.imshow(converted[:,:,1], cmap='gray')
ax5.set_title('Image Channel 1', fontsize=24)
ax6.imshow(hog_images[1], cmap='gray')
ax6.set_title('Hog Feature Image Channel 1', fontsize=24)
ax7.imshow(converted[:,:,2], cmap='gray')
ax7.set_title('Image Channel 2', fontsize=24)
ax8.imshow(hog_images[2], cmap='gray')
ax8.set_title('Hog Feature Image Channel 2', fontsize=24)
result = np.concatenate(result)
return result
img = cv2.imread('./test_images/image_kitti.png')
feature = extract_features(img, quiet = False)
Read all data from the 'vehicles' and 'non-vehicles' folder provided. It takes more than 5m to extract features from 8k+ training images in each folder. So I read labeled data and store them in local pickle file.
vehicle_features = []
non_vehicle_features = []
t1 = time()
for img_path in glob.glob('../vehicles/*/*.png'):
img = cv2.imread(img_path)
vehicle_features.append(extract_features(img))
print(len(vehicle_features))
print("Loading labeled vehicle time: ", round(time()-t1, 3), "s")
t1 = time()
for img_path in glob.glob('../non-vehicles/*/*.png'):
img = cv2.imread(img_path)
non_vehicle_features.append(extract_features(img))
print(len(non_vehicle_features))
print("Loading labeled non-vehicle time: ", round(time()-t1, 3), "s")
# Save the extracted features to local
pickle_file = {}
pickle_file["vehicle_features"] = vehicle_features
pickle_file["non_vehicle_features"] = non_vehicle_features
pickle.dump(pickle_file, open( "./gray_features.p", "wb" ))
Load the features stored in local pickle. The features are normalized and split into training and testing part. The training part of the features are feed into a SVM classifier. I used the LinearSVC classifier given in the lecture script. Seems the accuracy is sufficient for vehicle detection. The training speed is impressive. 80% of 17K is still over 13K. But the training only took 1.6s.
file = open("./gray_features.p",'rb')
features = pickle.load(file)
vehicle_features = features["vehicle_features"]
print(len(vehicle_features))
non_vehicle_features = features["non_vehicle_features"]
print(len(non_vehicle_features))
print(np.shape(vehicle_features[0]))
X = np.vstack((vehicle_features, non_vehicle_features)).astype(np.float64)
X_scaler = StandardScaler().fit(X)
scaled_X = X_scaler.transform(X)
y = np.hstack((np.ones(len(vehicle_features)), np.zeros(len(non_vehicle_features))))
# Split up data into randomized training and test sets
rs = np.random.randint(0, 100)
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.2, random_state=rs)
svc = LinearSVC(loss='hinge') # Use a linear SVC
t1 = time()
svc.fit(X_train, y_train) # Train the classifier
print("Training time: ", round(time()-t1, 3), "s")
print('Test Accuracy of SVC = ', round(svc.score(X_test, y_test), 4)) # Check the score of the SVC
Since we used the same video from previous project, we need to perform remove image distortion from each frame. The pickle file containing the camera matrix and distortion coefficients from previous calibration is reused here.
# load dist pickle that contains the camera matrix and distortion coefficients
def read_mtx_dist():
file = open("./wide_dist_pickle.p",'rb')
dist_pickle = pickle.load(file)
file.close()
mtx = dist_pickle["mtx"]
dist = dist_pickle["dist"]
return mtx, dist
def remove_distortion(img, mtx, dist, quiet = True):
dst = cv2.undistort(img, mtx, dist, None, mtx)
#cv2.imwrite('../camera_cal/test_undist.jpg',dst)
if(quiet == False):
# Visualize undistortion
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,10))
ax1.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
ax1.set_title('Original Image', fontsize=24)
ax2.imshow(cv2.cvtColor(dst, cv2.COLOR_BGR2RGB))
ax2.set_title('Undistorted Image', fontsize=24)
return dst
mtx, dist = read_mtx_dist()
# Test undistortion on an image
test_cali_img = cv2.imread('./calibration1.jpg')
undistort = remove_distortion(test_cali_img, mtx, dist, False)
Here we define a sliding window function slide_window to generate a list of boxes with predefined parameters and a draw_boxes to draw the list of boxes on an image. I used 3 types of window with different sizes corresponding to the distance to the observing car. Note there are overlapping area since it's impossible to tell the size of the vehicles simply based on their location on the image. Larger 'xy_overlap' tends to give much better result since we are doing the window scan more frequently and is more likely to detect target and form the heat spot above threshold. However, this also requires more computing power and time.
# Here is your draw_boxes function from the previous exercise
def draw_boxes(img, bboxes, color=(0, 0, 255), thick=6):
# Make a copy of the image
imcopy = np.copy(img)
# Iterate through the bounding boxes
for bbox in bboxes:
# Draw a rectangle given bbox coordinates
cv2.rectangle(imcopy, bbox[0], bbox[1], color, thick)
# Return the image copy with boxes drawn
return imcopy
# Define a function that takes an image,
# start and stop positions in both x and y,
# window size (x and y dimensions),
# and overlap fraction (for both x and y)
# Define a function that takes an image,
# start and stop positions in both x and y,
# window size (x and y dimensions),
# and overlap fraction (for both x and y)
def slide_window(img, x_start_stop=[None, None], y_start_stop=[None, None],
xy_window=(64, 64), xy_overlap=(0.5, 0.5)):
# If x and/or y start/stop positions not defined, set to image size
if x_start_stop[0] == None:
x_start_stop[0] = 0
if x_start_stop[1] == None:
x_start_stop[1] = img.shape[1]
if y_start_stop[0] == None:
y_start_stop[0] = 0
if y_start_stop[1] == None:
y_start_stop[1] = img.shape[0]
# Compute the span of the region to be searched
xspan = x_start_stop[1] - x_start_stop[0]
yspan = y_start_stop[1] - y_start_stop[0]
# Compute the number of pixels per step in x/y
nx_pix_per_step = np.int(xy_window[0]*(1 - xy_overlap[0]))
ny_pix_per_step = np.int(xy_window[1]*(1 - xy_overlap[1]))
# Compute the number of windows in x/y
nx_buffer = np.int(xy_window[0]*(xy_overlap[0]))
ny_buffer = np.int(xy_window[1]*(xy_overlap[1]))
nx_windows = np.int((xspan-nx_buffer)/nx_pix_per_step)
ny_windows = np.int((yspan-ny_buffer)/ny_pix_per_step)
# Initialize a list to append window positions to
window_list = []
# Loop through finding x and y window positions
# Note: you could vectorize this step, but in practice
# you'll be considering windows one by one with your
# classifier, so looping makes sense
for ys in range(ny_windows):
for xs in range(nx_windows):
# Calculate window position
startx = xs*nx_pix_per_step + x_start_stop[0]
endx = startx + xy_window[0]
starty = ys*ny_pix_per_step + y_start_stop[0]
endy = starty + xy_window[1]
# Append window position to list
window_list.append(((startx, starty), (endx, endy)))
# Return the list of windows
return window_list
img = cv2.imread('./test_images/test1.jpg')
windows1 = slide_window(img, x_start_stop=[0, 320], y_start_stop=[400, 640],
xy_window=(140, 140), xy_overlap=(0.78, 0.78))
windows2 = slide_window(img, x_start_stop=[280, 500], y_start_stop=[400, 560],
xy_window=(80, 80), xy_overlap=(0.75, 0.75))
windows3 = slide_window(img, x_start_stop=[500, 780], y_start_stop=[400, 500],
xy_window=(100, 100), xy_overlap=(0.75, 0.75))
windows4 = slide_window(img, x_start_stop=[780, 1000], y_start_stop=[400, 560],
xy_window=(80, 80), xy_overlap=(0.75, 0.75))
windows5 = slide_window(img, x_start_stop=[960, 1280], y_start_stop=[400, 640],
xy_window=(140, 140), xy_overlap=(0.78, 0.78))
windows = windows1 + windows2 + windows3 + windows4 + windows5
for img_path in glob.glob('./test_images/test*.*'):
img = cv2.imread(img_path)
img = remove_distortion(img, mtx, dist, True)
window_img = draw_boxes(img, windows, color=(0, 0, 255), thick=6)
plt.figure()
plt.imshow(cv2.cvtColor(window_img, cv2.COLOR_BGR2RGB))
Each window defined by slide_windows() will be filtered by the trained predictor and classified as 'vehicle' or 'non-vehicle'. Note to save final video processing time, I removed the two windows on the left side. I tested it on these area and there are minimum false alarms for the given video.
windows1 = slide_window(img, x_start_stop=[600, 800], y_start_stop=[400, 500],
xy_window=(100, 100), xy_overlap=(0.8, 0.8))
windows2 = slide_window(img, x_start_stop=[780, 1000], y_start_stop=[400, 560],
xy_window=(80, 80), xy_overlap=(0.8, 0.8))
windows3 = slide_window(img, x_start_stop=[960, 1280], y_start_stop=[400, 640],
xy_window=(140, 140), xy_overlap=(0.78, 0.78))
windows = windows1 + windows2 + windows3
def search_windows(img, windows, clf, scaler):
#1) Create an empty list to receive positive detection windows
on_windows = []
#2) Iterate over all windows in the list
for window in windows:
#3) Extract the test window from original image
test_img = cv2.resize(img[window[0][1]:window[1][1], window[0][0]:window[1][0]], (64, 64))
#4) Extract features for that window using single_img_features()
features = extract_features(test_img)
#5) Scale extracted features to be fed to classifier
test_features = scaler.transform(np.array(features).reshape(1, -1))
#6) Predict using your classifier
prediction = clf.predict(test_features)
#7) If positive (prediction == 1) then save the window
if prediction == 1:
on_windows.append(window)
#8) Return windows for positive detections
return on_windows
for img_path in glob.glob('./test_images/test*.*'):
img = cv2.imread(img_path)
img = remove_distortion(img, mtx, dist, True)
hot_windows = search_windows(img, windows, svc, X_scaler)
window_img = draw_boxes(img, hot_windows, color=(0, 0, 255), thick=6)
plt.figure()
plt.imshow(cv2.cvtColor(window_img, cv2.COLOR_BGR2RGB))
Note on the result above the windows are overlapping and many adjacent windows might all detect the same vehicle. We need to find a way to form a unified bounding box for each detected target. The heat map method is introduced to accumulate the count for the overlapping area and only those above a given threshold will be considered valid prediction. I set the threshold to 2. This effectively eliminate majority of false alarm and effectively maintained real targets.
from scipy.ndimage.measurements import label
def add_heat(heatmap, bbox_list):
# Iterate through list of bboxes
for box in bbox_list:
# Add += 1 for all pixels inside each bbox
# Assuming each "box" takes the form ((x1, y1), (x2, y2))
heatmap[box[0][1]:box[1][1], box[0][0]:box[1][0]] += 1
# Return updated heatmap
return heatmap# Iterate through list of bboxes
def apply_threshold(heatmap, threshold):
# Zero out pixels below the threshold
heatmap[heatmap <= threshold] = 0
# Return thresholded map
return heatmap
def draw_labeled_bboxes(img, labels):
# Iterate through all detected cars
for car_number in range(1, labels[1]+1):
# Find pixels with each car_number label value
nonzero = (labels[0] == car_number).nonzero()
# Identify x and y values of those pixels
nonzeroy = np.array(nonzero[0])
nonzerox = np.array(nonzero[1])
# Define a bounding box based on min/max x and y
bbox = ((np.min(nonzerox), np.min(nonzeroy)), (np.max(nonzerox), np.max(nonzeroy)))
# Draw the box on the image
cv2.rectangle(img, bbox[0], bbox[1], (0,0,255), 6)
# Return the image
return img
for img_path in glob.glob('./test_images/test*.*'):
img = cv2.imread(img_path)
img = remove_distortion(img, mtx, dist, True)
hot_windows = search_windows(img, windows, svc, X_scaler)
window_img = draw_boxes(img, hot_windows, color=(0, 0, 255), thick=6)
plt.figure()
heat = np.zeros_like(img[:,:,0]).astype(np.float)
# Add heat to each box in box list
heat = add_heat(heat,hot_windows)
# Apply threshold to help remove false positives
heat = apply_threshold(heat,1)
# Visualize the heatmap when displaying
heatmap = np.clip(heat, 0, 255)
labels = label(heatmap)
draw_img = draw_labeled_bboxes(np.copy(img), labels)
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,10))
ax1.imshow(cv2.cvtColor(window_img, cv2.COLOR_BGR2RGB))
ax1.set_title('Car Positions: Multiple Positions', fontsize=24)
ax2.imshow(cv2.cvtColor(draw_img, cv2.COLOR_BGR2RGB))
ax2.set_title('Car Positions', fontsize=24)
ax3.imshow(heatmap, cmap='hot')
ax3.set_title('Heat Map', fontsize=24)
The final pipeline contains the image distortion, window search, heatmap thresholding and labeling. Although the heatmap looks valid in the previoius section, in individual frames many false positive appeared in the processed video. The correct tracking boxes also look flaky. A straight-forward way to improve is that using a deque to store history heat map boxes. Only boxes recent 10 frames would be stored. Only when accumulated heat values exceed a threashold will this position be considered a true detection. This method will greatly mitigate false positives and also renders much smoother tracking boxes.
from collections import deque
n_frames = 10
heatmaps = deque(maxlen = n_frames)
def pipeline(img):
img = remove_distortion(img, mtx, dist, True)
hot_windows = search_windows(img, windows, svc, X_scaler)
heat = np.zeros_like(img[:,:,0]).astype(np.float)
# Add heat to each box in box list
heat = add_heat(heat,hot_windows)
print(np.shape(heat))
# Apply threshold to help remove false positives
#heat = apply_threshold(heat,1)
#print(heat)
# Visualize the heatmap when displaying
heatmaps.append(heat)
print(np.shape(heatmaps))
threshold = 11
combined = sum(heatmaps)
print(np.shape(combined))
combined = apply_threshold(combined, threshold)
print(np.shape(combined))
combined = np.clip(combined, 0, 255)
labels = label(combined)
draw_img = draw_labeled_bboxes(np.copy(img), labels)
return draw_img
Call pipeline function for each frame of the video and recompose a new video.
from moviepy.editor import VideoFileClip
output = './project_video_output.mp4'
clip1 = VideoFileClip("./project_video.mp4").subclip(5,50)
white_clip = clip1.fl_image(pipeline)
%time white_clip.write_videofile(output, audio=False)
from IPython.display import HTML
HTML("""
<video width="640" height="360" controls>
<source src="{0}">
</video>
""".format('./project_video_output.mp4'))
From the video result we can see the both the white and black car are precisely captured in most of the frames. So we know the feature extraction is not biased by color. However, the shadow area tends to give false alarm which seems implying the LUV method is vunerable to lights changes. A combination of more color spaces and channels might be more robust while it would be more resource demanding and may further harm the processing speed. It is vital to strike a balance between them. Cars can be captrued when they are close or when getting further. This means the window size division based on distance is valid. I only give 3 groups windows with small overlapping area. More window with different sizes and larger overlapping area will be sure to present better result. In that case, heat map threshold might need to be adjusted higher since the same area will be examed much more times. Note the given heat map method can hardly provide bounding box with consistant size. It's caused by the nature of the heat map method. It's possible to further process the detected area to give more precise result. Another possibility is to apply the lane detecting technique from previous project and tell the distance of each area. Then we can better decide the search window size for a given location. This will potentially present more precise predict and save computing power. A typical use case is like: for the given video, obvious on left of the yellow line there would be no vehicles and there is no need to place windows there.